---
title: "Text classification: Detection of spam messages"
output: 
  word_document: default
  html_notebook: default
---

### About the data set

The dataset that we use in this example is a preprocessed subset of the [Ling-Spam Dataset](http://csmining.org/index.php/ling-spam-datasets.html). It is based on 960 real email 
messages from a linguistics mailing list. 
The dataset was originaly prepared for the Machine Learning course taught by Stanford Prof. Andrew Ng; it is downloaded from: http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html

The dataset is split into two subsets: a 700-email subset for training and a 260-email subset for testing; each subset contains 50% spam and 50% nonspam messages. The data is stored in several .txt files (each email message in a separate file) organized into 4 directories: spam-train, nonspam-train, spam-test, and nonspam-test. These directories are made available in the data/emails directory within this project.  

### Load the required R packages and utility functions
```{r message=FALSE}
library(tm)
library(caret)
library(class) # for kNN classifier

source("text_mining_utils.R")
```


### Loading the traing and test sets

We'll start by loading the data that will be used for the training and testing of the classifier. 
First, get full names of the files (with email messages) to be used for the training set:  
```{r}
spam.train.files <- list.files(path = "data/emails/spam-train", full.names = TRUE)
#spam.train.files[1:10]

nonspam.train.files <- list.files(path = "data/emails/nonspam-train", full.names = TRUE)
# nonspam.train.files[1:10]
```

Create a data frame for the training data; the data frame will have 3 columns (variables):

* message file path
* message label (spam / nonspam)
* message content (initially NA)
```{r}
train.data <- data.frame(fpath = c(spam.train.files, nonspam.train.files),
                         label = c(rep("spam", times=length(spam.train.files)),
                                   rep("nonspam", times=length(nonspam.train.files))),
                         text = NA, stringsAsFactors = FALSE)
str(train.data)
```

Read text from the training set of spam and nonspam messages, and use it to populate the *text* variable of the *train.data* dataframe:
```{r warning=FALSE}
for(i in 1:nrow(train.data)) {
  train.data$text[i] <- read.text(train.data$fpath[i]) 
}
# head(train.data)
```

We should just transform the *label* variable into a factor:
```{r}
train.data$label <- as.factor(train.data$label)
summary(train.data$label)
```

Now, load the test data and build a data frame that will be used for testing the classifier.
The procedure is the same as the one we followed for the training data, so, we'll do all in one chunk:
```{r warning=FALSE}
spam.test.files <- list.files(path = "data/emails/spam-test", full.names = TRUE)
nonspam.test.files <- list.files(path = "data/emails/nonspam-test", full.names = TRUE)

test.data <- data.frame(fpath = c(spam.test.files, nonspam.test.files),
                         label = c(rep("spam", times=length(spam.test.files)),
                                   rep("nonspam", times=length(nonspam.test.files))),
                         text = NA, stringsAsFactors = FALSE)

for(i in 1:nrow(test.data)) {
  test.data$text[i] <- read.text(test.data$fpath[i]) 
}

test.data$label <- as.factor(test.data$label)

# head(test.data)
```

### Data preparation (text preprocessing)

The loaded text is already pre-processed: 

* stop-words have been removed
* numbers and punctuation have been removed 
* text has been converted to lower case
* it has also been lemmatized
* all white spaces (tabs, newlines, spaces) have been trimmed to a single space character.

So, no need for pre-processing it here. We just need to create a corpus and then a Document Term Matrix.

#### Create a corpus

We will create the corpus using both training and testing data sets; later, when building a classifier, we will, of course, use just the training portion of the corpus.
So, let's first merge the two data sets:
```{r}
all.data <- rbind(train.data, test.data)
dim(all.data)
```

We have 960 instances in total: first 700 are for training, the rest (260) for testing. 

Now, we can create the corpus:
```{r}
corpus <- Corpus(VectorSource(all.data$text)) 
```

#### Create a Document Term Matrix

Next, we create a Document Term Matrix (DTM) using words of length 2+ and term frequency (TF) weighting scheme (the default):
```{r}
dtm <- DocumentTermMatrix(x = corpus,
                          control = list(wordLengths = c(2,Inf),
                                         weighting = weightTfIdf))
inspect(dtm)
```

We have very high number of words (almost 23K), and very sparse DTM (99%). So, we should better remove the sparse terms:
```{r}
dtm.reduced <- removeSparseTerms(dtm, sparse = 0.975)
dtm.reduced
```

This looks better: 1268 words preserved out of the original set of 22,744 words (~6%); the overall sparsity is reduced to 93%, and the max term length is reduced to 15 characters (from 74)

Since we want to use DTM for classification purposes, we need to transform it into a 'simple' matrix that can be passed to a function for building a classifier:
```{r}
dtm.final <- as.matrix(dtm.reduced)
dim(dtm.final)
```

Split the matrix into training and test parts
```{r}
training.dtm <- dtm.final[1:nrow(train.data),]
test.dtm <- dtm.final[(nrow(train.data)+1):nrow(dtm.final),]
```


### Create kNN classifier

To execute the KNN classification in R, we will use the knn() f. from the *class* package (already loaded). We need to provide the knn() f. with the following data:

* training data with no labels, 
* test data with no labels, 
* labels for the training set,
* number of neighbours to consider (parameter k)

Initially, we will simply guess the number of neighbours (k), and later on we will apply a more systematic approach to determine the best value for k.  
```{r}
k <- 5
train.labels <- train.data$label
set.seed(2612)
knn.fit <- knn(train = training.dtm,
               test = test.dtm,
               cl = train.labels,
               k = k)
```

Create the confusion matrix
```{r}
test.labels <- test.data$label
conf.mat <- table(Predictions = knn.fit, 
                  Actual = test.labels)
conf.mat
```

Out of 260 observations in the test set, we have 16 false negatives (FN), and 3 false positives (FP). In this case, FPs are emails that are not spam but were classified as spam, whereas FNs are spam emails that were recognized as not spam. So, we should try to reduce FP (not spam that ends up in the Spam folder), even at the expense of (slightly) increasing FN (spam messages that end up in Inbox); in other words, we should be ready to 'traid' some recall for higher precision. 

Compute evaluation measures:
```{r}
knn1.eval <- compute.eval.measures(conf.mat)
knn1.eval
```

Now, intead of guessing, we'll cross-validate kNN models with different values for k, and see which value fo k gives the best performance. Then, we'll use the *test.set* to test the model that proves to be the best.

The caret package will be used to find the optimal parameter (k) value through cross validation.
First, define cross-validation (cv) parameters; we'll do 10-fold cross-validation:
```{r}
numFolds = trainControl( method = "cv", number = 10 )
```
Then, define the range of k values to examine in the cross-validation. We'll take odd numbers between 3 and 25 - recall that in case of binary classification, it is recommended to choose an odd number for k 
```{r}
kGrid = expand.grid(.k = seq(from = 3, to = 25, by = 2)) 
```

Train the model through cross-validation
```{r}
set.seed(2612)
knn.cv <- train(x = training.dtm, 
                y = train.labels, 
                method = "knn", 
                trControl = numFolds, 
                tuneGrid = kGrid)
```

Examine the obtained results for different values of k:
```{r}
knn.cv
```

```{r}
plot(knn.cv)
```


For k=13, we get the best value for all the examined metrics. So, we choose k=13 as the number of neighbours.
```{r}
knn.fit2 <- knn(train = training.dtm,
                test = test.dtm,
                cl = train.labels,
                k = 13)
```

```{r}
conf.mat2 <- table(Predicted = knn.fit2, Actual = test.labels)
conf.mat2
```
We reduced the number of FNs, but slightly increased the number of FPs.

```{r}
knn2.eval <- compute.eval.measures(conf.mat2)
knn2.eval
```

Compare the evaluation measures obtained for the two models:
```{r}
data.frame(rbind(knn1.eval, knn2.eval), row.names = c("k=5", "k=13"))
```
We've managed to improve all the metrics except for precision. 
